为了同时朝着对多个下游任务的整体理解,需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现,但它们对多任务学习方案的概括能力尚待探索。在本文中,我们在三个下游任务上进行了广泛研究各种类型的自我监督方法的转移性能,例如Moco和Simc​​lr,包括语义细分,可驱动的区域细分和交通对象检测,在大规模驾驶数据集中BDD100K。我们出人意料地发现,他们的表现是最佳的甚至落后于单任务基线的滞后,这可能是由于训练目标和建筑设计的区别在于预处理范式。为了克服这一难题,并避免重新设计资源密集的预培训阶段,我们提出了一种简单而有效的预处理 - 适应性 - 赛范围,用于一般的多任务培训,可以有效地适应现行预审预周态的模型没有增加培训开销。在自适应阶段,我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重,同时使经过预告片的知识未经触及。此外,我们将视觉语言预训练模型剪辑视为对预处理 - 适应 - 最终范式的强烈补充,并提出了一个名为LV-Adapter的新型适配器,该适配器通过任务特定的提示将语言先验纳入了多任务的模型中和视觉和文本特征之间的对齐。
translated by 谷歌翻译
视觉变压器在许多视觉任务中都达到了最新的性能。由于自我注意的二次计算和记忆复杂性,最近的作品要么仅将注意力应用于低分辨率输入,要么将接受场限制在小地方区域。为了克服这些局限性,我们提出了仅关键的关注,该关注不包括查询键的成对相互作用,并使用计算有效的显着性门来获得注意力重量,从而在所有阶段进行局部全球相互作用进行建模。仅密钥注意力具有线性计算和内存复杂性W.R.T输入大小。我们使用替代布局来杂交卷积和注意力层,而不是以前的作品所建议的嫁接,以便所有阶段都可以从空间的注意力和卷积中受益。我们利用这些改进来开发一个新的自我注意模型家族Linglos,该家族达到了最新的Imagenet分类基准的参数限制设置的精确度,并且在下游任务中极大地超过了基线,例如COCO对象检测和ADE20K语义细分。
translated by 谷歌翻译
超声(US)成像通常用于协助诊断和脊柱疾病的干预,而通过手动操作探针进行标准化美国收购需要大量的经验和超声检查的培训。在这项工作中,我们提出了一种新的双代理框架,集成了强化学习(RL)代理和深度学习(DL)代理,以共同确定基于实时超声图像美国探测器的移动,以模拟专家超声检查操作者的决策过程,以实现脊柱超声自主标准视图收购。此外,通过美国传播的性质和脊柱解剖的特性的启发,我们引入一个视图特定的声影奖励利用阴影信息来隐式地引导朝向脊柱的不同标准视图探针的导航。我们的方法在从$ $ 17名志愿者获得的美国经济数据建立了一个模拟环境的定量和定性实验验证。平均导航精度朝向不同的标准视图达到$5.18毫米/ 5.25 ^ \ CIRC $ $和12.87毫米/ 17.49 ^ \ CIRC $在分子内和主体间设置,分别。结果表明,我们的方法可以有效地解释美国的图像和导航探头获取脊柱多种标准的意见。
translated by 谷歌翻译
近年来,对主动无线胶囊内窥镜(WCE)的同时磁力驱动和定位(SMAL)进行了集中研究,以提高检查的效率和准确性。在本文中,我们提出了一种用于主动WCE的自主磁导航框架,其模仿常规结肠镜检查的专家医师的“插入”和“提取”程序,从而使机器人胶囊内窥镜在肠道中有效和准确地进行了最小的用户努力。首先,胶囊通过未知的肠道环境自动推进,并产生一种代表环境的可行路径。然后,胶囊被自主地驶向肠道轨迹上选择的任何点,以便准确和反复检查可疑病变。此外,我们在加入高级Smal算法的机器人系统上实现了导航框架,并在使用幽灵和前体内猪结肠中验证各种管状环境的导航中。我们的结果表明,拟议的自主导航框架可以有效地在未知,复杂的管状环境中导航胶囊,其与手动操作相比具有令人满意的精度,重复性和效率。
translated by 谷歌翻译
目前使用的无线胶囊内窥镜检查(WCE)是在检查时间和柔韧性方面有限的,因为胶囊被蠕动被动地移动,并且不能精确定位。已经提出了基于同时磁力驱动和定位技术的WCE的有效运动来促进不同的方法。在这项工作中,我们研究了在管状环境中旋转磁性致动下的机器人胶囊问题的轨迹,以实现使用无线胶囊内窥镜在给定点对肠道的安全,高效准确地检查肠道。具体而言,基于PD控制器,自适应控制器,模型预测控制器和鲁棒的多级模型预测控制器,开发了四种轨迹之后的策略。此外,我们的方法通过在控制器设计期间模拟肠蠕动和摩擦来考虑肠环境中的不确定性。我们验证了我们在仿真中的方法以及在各种管状环境中的实际实验中,包括具有不同形状和前体内猪结肠的塑料幽灵。结果表明,我们的方法可以有效地致动往复旋转的胶囊,以遵循复杂的管状环境中的所需轨迹,从而具有能够对高质量诊断进行准确和可重复检查的肠道。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译